In this notebook, I will explore a dataset of sneaker orders from 100 sneaker shops. I will try to find out the reason behind a high average order value (AOV) of $3145.13 over a 30 day window. Then, I will determine a more accurate metric that would represent more this dataset.
I will be answering the following questions: Think about a better way to evaluate this data. What metric would you report for this dataset? What is its value?
First let's read the dataset. I downloaded it as a CSV file and used the library Pandas to convert it into a dataframe.
local_data_path = "./data/shopify-dataset.csv"
import pandas as pd
# Read dataset into a pandas dataframe
df_data = pd.read_csv(local_data_path)
df_data
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 |
| 1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 |
| 2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 |
| 3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 |
| 4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 73 | 993 | 330 | 2 | debit | 2017-03-30 13:47:17 |
| 4996 | 4997 | 48 | 789 | 234 | 2 | cash | 2017-03-16 20:36:16 |
| 4997 | 4998 | 56 | 867 | 351 | 3 | cash | 2017-03-19 5:42:42 |
| 4998 | 4999 | 60 | 825 | 354 | 2 | credit_card | 2017-03-16 14:51:18 |
| 4999 | 5000 | 44 | 734 | 288 | 2 | debit | 2017-03-18 15:48:18 |
5000 rows × 7 columns
I generated a profile report of the dataframe using pandas_profiling since df.describe() function is too basic.
from pandas_profiling import ProfileReport
profile_report = ProfileReport(df_data)
# Save the profile report
profile_report.to_file(output_file='./data/output_pandas_profiling.html')
profile_report
By looking at the profile report and analyzing each variable, I noticed something strange in the order_amount variable. Indeed, the maximal value observed is 704 000 and the minimum value is 90. There is a major difference in both of these extremums. I can also see the AOV value as the mean of the values of order_amount. Indeed, in this dataset, we only have orders of the same month which explains why the AOV for a 30 day period is the same as the mean of the order_amount. Also, the variable total_items had a major difference in its extremums with a minimum of 1 and a maximum of 2000 items.
By looking at the warnings generated by the profile report, I noticed that there was a high correlation between the variables I mentioned. Indeed, by looking at the correlation matrices like Pearson's, we can clearly see a strong correlation between the order_amount and the total_items. This is expected since the order_amount is obtained by a product of the total_items with the price of each shoe that we do not have in this dataset.
Therefore, I decided to explore in more detail the order_amount variable. It will probably lead to the same conclusions as analyzing the total_items variable because of their strong correlation.
Before exploring a dataset, it is always a good idea to look if there are any missing values. Even though, I could clearly see in the profile report that there are not missing values. I wanted to showcase a library I like using to visualize my data to observe how many missing values are present.
import missingno as mno
fig = mno.matrix(df_data)
fig_copy = fig.get_figure()
fig_copy.savefig('./data/matrix_missing_plot.png', bbox_inches = 'tight')
By looking at the previous diagram, we can see that there are no missing values in the dataset.
Here is an example of a dataset I worked with that had a lot of missing values.
The white lines represent the missing data points. We can clearly see a difference with the current shopify orders dataset that had a diagram with no missing data points.
In this section, I decided to visualize the order_amount values in more details.
import matplotlib.pyplot as plt
# Create a figure and axis
fig, ax = plt.subplots()
# Scatter the order_id against the order_amount
ax.scatter(df_data['order_id'], df_data['order_amount'])
# Set a title and labels
ax.set_title('order_amount vs. order_id ')
ax.set_xlabel('order_id')
ax.set_ylabel('order_amount')
Text(0, 0.5, 'order_amount')
As we can see, most of the orders have an order_amount less than 100 000. We can clearly see one data point around 150 000 and a couple of data points at around 700 000. I went for a scatter plot to represent my data as a collection of points. By looking at the plot, I can clearly see most of the data points close to 0, causing the points to form a horizontal line at the bottom.
# Create a figure and axis
fig, ax = plt.subplots()
ax.hist(df_data["order_amount"])
# Set title and labels
ax.set_title('order_amount vs. count')
ax.set_xlabel('Order_amount')
ax.set_ylabel('Count')
Text(0, 0.5, 'Count')
In this bar plot, I represented the order_amount values against the number of orders with the value. I can clearly see that around 5000 data points are around the same values. Only a minor amount of data points is seen with an order amount of 100 000 and 700 000. These outliers may be the reason why the AOV is that big.
df_data.plot(kind='box', figsize=(8, 6))
plt.title('Box Plot of Variables')
plt.ylabel('Order_amount')
plt.show()
At this point, it is clear that there are outliers. But, I wanted to generate a box plot to view all the outliers of all the variables. I can clearly see the outliers for the order_amount and for total_items.
import seaborn as sns
sns.boxplot(x=df_data['order_amount'])
<AxesSubplot:xlabel='order_amount'>
This is an individual representation of the outliers of order_amount using a box plot generated with Seaborn.
import seaborn as sns
sns.boxplot(x=df_data['total_items'])
<AxesSubplot:xlabel='total_items'>
For the total_items, the only outliers seem to have the same value 2000.
df_data[df_data["order_amount"]==704000]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 4:00:00 |
| 520 | 521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 1104 | 1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 1362 | 1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 4:00:00 |
| 1436 | 1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 4:00:00 |
| 1562 | 1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 4:00:00 |
| 1602 | 1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 4:00:00 |
| 2153 | 2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 4:00:00 |
| 2297 | 2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 2835 | 2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 2969 | 2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 3332 | 3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 4056 | 4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 4646 | 4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 4868 | 4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 4:00:00 |
| 4882 | 4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 4:00:00 |
It is now clear to see that for each order with an order_amount of 704 000, it also has the maximum value of the total_items 2000, which is the outlier value of the total_items variable.
This is an interactive scatter plot using plotly. I can hover over each point to vue the values. It is an alternative to the scatter plot generated with matplotlib.
import plotly.express as px
fig = px.scatter(df_data, x="order_id", y="order_amount", title="Order_amount based on Order_id")
fig.show()
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
app = dash.Dash(__name__)
app.layout = html.Div([
dcc.Graph(id="scatter-plot"),
html.P("Order Amount:"),
dcc.RangeSlider(
id='range-slider',
min=0, max=800000, step=1,
marks={0: '0', 800000: '800000'},
value=[0, 500]
),
])
@app.callback(
Output("scatter-plot", "figure"),
[Input("range-slider", "value")])
def update_bar_chart(slider_range):
low, high = slider_range
mask = (df_data['order_amount'] > low) & (df_data['order_amount'] < high)
fig = px.scatter(
df_data[mask], x="order_id", y="order_amount",
color="order_amount", size='order_amount',
hover_data=['order_amount'])
return fig
app.run_server(debug=False)
Dash is running on http://127.0.0.1:8050/ * Serving Flask app "__main__" (lazy loading) * Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead. * Debug mode: off
* Running on http://127.0.0.1:8050/ (Press CTRL+C to quit)
from IPython.display import Video
Video("Dash.mp4", embed=True)
In this video, I show the usage of a scatter plot I generated using Dash. It is interactive and very user friendly. Each data point is represented in proportion to the scale. The order_amount is identified by the color of each dot. The scale can also be adjusted to have a better view of data points with smaller or bigger order amounts.
By looking at the video, we can see the difference in proportions for most of the data points at the bottom against the outliers with an order amount superior to 700k. Indeed, I had to reduce the scale significantly to view the majority of the data points.
Let's see how many orders have an order amount equal to 7040000.
df_data[df_data["order_amount"]==704000].shape[0]
17
There are 17 orders with this outlier amount. We can also see that all these orders were made by using a credit card. Also, these were done by the same user with the id 607, the total amount of items ordered is 2000 and the order was always made at the same time 4:00:00 for different days in March in the year 2017. The amount of items is way too big to be a normal order. Usually, a regular user won't buy more than five pairs of shoes.
Since 17 orders were made in the same month with an amount that big, this would explain why the average order value is $3145.13. Indeed, by looking at the profile report generated, I could see this value as the mean of the column order_amount.
At this point, I detected the outliers using mostly visual representations. I decided to explore more statistical ways to do so.
The Z score is used to represent the relationship between the data point (observation) and the mean as well as the standard deviation of the dataset. The purpose is to find the distribution of data with a mean of 0 and a standard deviation of 1. This represents a normal distribution. Indeed, the Z-score consists of re-scaling and centralize the data to detect data points that have values far from zero.
If the Z-score of a data point is greater than the threshold of 3, than the data point is considered an outlier.
import numpy as np
outliers = []
def retrieve_outliers(column_data):
threshold = 3
mean = np.mean(column_data)
std = np.std(column_data)
print("Mean: " + str(mean))
print("Standard deviation: " + str(std))
for observation in column_data:
z_score = (observation - mean)/std
if np.abs(z_score) > threshold:
outliers.append(observation)
return outliers
outliers_order_amount = retrieve_outliers(df_data["order_amount"])
print("Outliers of the column order_amount are: " + str(outliers_order_amount))
print("There are " + str(len(outliers_order_amount)) + " outliers.")
Mean: 3145.128 Standard deviation: 41278.41088841982 Outliers of the column order_amount are: [704000, 704000, 704000, 154350, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000, 704000] There are 18 outliers.
Like I detected before, there are 17 outliers with an order_amount of 704000 and one value of 154350.
The IQR method is used to generate box plots. By using this method, I should be able to measure statistical dispersion. Since it measures the dispersion, it is similar to the variance and the standard deviation. Its advantage is that it is not impacted by the presence of outliers.
To obtain the IQR value, I subtract the first quartile (The 25th percentile) from the third quartile (The 75th percentile).
# The 25th percentile
Q1 = df_data.quantile(0.25)
print("The 25th percentile: " + str(Q1))
# The 75th percentile
Q3 = df_data.quantile(0.75)
print("The 75th percentile: " + str(Q3))
# The interquartile range
IQR = Q3 - Q1
print("IQR: " + str(IQR))
The 25th percentile: order_id 1250.75 shop_id 24.00 user_id 775.00 order_amount 163.00 total_items 1.00 Name: 0.25, dtype: float64 The 75th percentile: order_id 3750.25 shop_id 75.00 user_id 925.00 order_amount 390.00 total_items 3.00 Name: 0.75, dtype: float64 IQR: order_id 2499.5 shop_id 51.0 user_id 150.0 order_amount 227.0 total_items 2.0 dtype: float64
is_outlier_df = (df_data < (Q1 - 1.5 * IQR)) |(df_data > (Q3 + 1.5 * IQR))
is_outlier_df.loc[is_outlier_df['order_amount'] == True]
| created_at | order_amount | order_id | payment_method | shop_id | total_items | user_id | |
|---|---|---|---|---|---|---|---|
| 15 | False | True | False | False | False | True | False |
| 60 | False | True | False | False | False | True | False |
| 99 | False | True | False | False | False | False | False |
| 136 | False | True | False | False | False | False | False |
| 160 | False | True | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4918 | False | True | False | False | False | False | False |
| 4927 | False | True | False | False | False | False | False |
| 4952 | False | True | False | False | False | False | False |
| 4958 | False | True | False | False | False | False | False |
| 4980 | False | True | False | False | False | False | False |
141 rows × 7 columns
def count_outliers(dataframe):
for col in dataframe.columns:
print("Column " + str(col) + " has " + str(dataframe[col].sum()) + " outliers.")
count_outliers(is_outlier_df)
Column created_at has 0 outliers. Column order_amount has 141 outliers. Column order_id has 0 outliers. Column payment_method has 0 outliers. Column shop_id has 0 outliers. Column total_items has 18 outliers. Column user_id has 0 outliers.
df_outliers = df_data[((df_data < (Q1 - 1.5 * IQR)) |(df_data > (Q3 + 1.5 * IQR))).any(axis=1)]
df_outliers
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 4:00:00 |
| 99 | 100 | 18 | 752 | 780 | 5 | cash | 2017-03-06 23:41:16 |
| 136 | 137 | 15 | 961 | 765 | 5 | credit_card | 2017-03-26 5:06:46 |
| 160 | 161 | 78 | 990 | 25725 | 1 | credit_card | 2017-03-12 5:56:57 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 4918 | 4919 | 78 | 823 | 25725 | 1 | cash | 2017-03-15 13:26:46 |
| 4927 | 4928 | 74 | 849 | 765 | 5 | cash | 2017-03-28 6:55:57 |
| 4952 | 4953 | 26 | 786 | 880 | 5 | cash | 2017-03-17 1:50:18 |
| 4958 | 4959 | 70 | 711 | 865 | 5 | credit_card | 2017-03-08 17:22:51 |
| 4980 | 4981 | 50 | 942 | 772 | 4 | credit_card | 2017-03-23 5:15:19 |
141 rows × 7 columns
As we can see, this method detected a lot more outliers. It detected 141 outliers for the order_amount.
fig = px.scatter(df_outliers, x="order_id", y="order_amount", title="Order_amount based on Order_id")
fig.show()
By looking at this scatter plot, the IQR method identified all the data points with an order_amount superior to around 730.5 as an outlier. Indeed, this is the value of the threshold I set to identify a data point as an outlier for the order_amount variable. It is the result of Q3 + 1.5 IQR = 390.0 + 1.5 227.0 = 730.5
If I look at the total_items variable, it has 18 outliers. Let's take a closer look at their values. The threshold is 3 + 1.5 * 2 = 6. Therefore, all the data points with a value in total_items greater than 6 are considered an outlier.
df_outliers[df_outliers["total_items"] > 6]
| order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
|---|---|---|---|---|---|---|---|
| 15 | 16 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 60 | 61 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-04 4:00:00 |
| 520 | 521 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 1104 | 1105 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 1362 | 1363 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-15 4:00:00 |
| 1436 | 1437 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-11 4:00:00 |
| 1562 | 1563 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-19 4:00:00 |
| 1602 | 1603 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-17 4:00:00 |
| 2153 | 2154 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-12 4:00:00 |
| 2297 | 2298 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-07 4:00:00 |
| 2835 | 2836 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 2969 | 2970 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 3332 | 3333 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-24 4:00:00 |
| 4056 | 4057 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-28 4:00:00 |
| 4141 | 4142 | 54 | 733 | 1064 | 8 | debit | 2017-03-07 17:05:18 |
| 4646 | 4647 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-02 4:00:00 |
| 4868 | 4869 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-22 4:00:00 |
| 4882 | 4883 | 42 | 607 | 704000 | 2000 | credit_card | 2017-03-25 4:00:00 |
These rows correspond to the outliers detected before which are probably the reason for a high AOV. As we can see, 17 out of the 18 rows have an order_amount of 704 000 and a total_items of 2000.
We can see that the mean is affected by these outliers.
df_data["order_amount"].mean()
3145.128
df_no_outliers = df_data[df_data["order_amount"] != 704000]
df_no_outliers["order_amount"].mean()
754.0919125025085
Clearly, the mean value is not an appropriate metric since it is highly impacted by the 17 values with \$ 704 000 as the order amount. The mean without the outliers is \\$ 754.09 which represents less than 25 percent of the value before removing the outliers with the maximum value \$ 3145.13 .
df_data["order_amount"].mode()
0 153 dtype: int64
df_no_outliers["order_amount"].mode()
0 153 dtype: int64
df_data["order_amount"].median()
284.0
df_no_outliers["order_amount"].median()
284.0
Mean, median and mode are measures of central tendency. But, the mean is the only value that is affected by the outliers. As we saw, the mean was more than four times bigger because the outliers are taken into consideration, but it was not the case for the mode neither the median.
One disadvantage of the mode is that if the majority of the values are distinct and an extremum value is more present in the dataset then the mode would take that extremum value. This is why I decided to ignore that metric.
We can clearly see that the median and the mode are not affected by the outliers. Indeed, I calculated both of these metrics before and after removing the outliers and their value is identical. It is not the case for the mean. For the mode, its value is 153 and for the median, it is 284. But I explained why I ignore the mode. Also, the IQR can be taken as a metric value since it is a robust measure when outliers are present in the dataset. To summarize, one of the following two metrics can be used as AOV instead of the mean.
a. IQR = 227.0
b. Median = 284.0
First, here the amount of orders interests us. Therefore, I am going to use COUNT() a function that returns the number of rows that matches a specified criterion. This function will be applied on the OrderID since it is the unique value and is considered as the primary key.
To access the Shipper's name, I have to access another table, the Shippers Table. I need to find the common information that would allow me to link the Orders table and the Shippers'. By looking at both tables, I can see that the common information is the ShipperID.
By calling, the INNER JOIN keyword, it selects records that have matching values in both tables.
Finally, I specify the condition using the WHERE clause to filter the records. In this case, I specified the ShipperName as "Speedy Express". Note that I am using aliases to specify the table the column is associated to avoid any ambiguity.
SELECT COUNT(o.OrderID) FROM Orders as o
INNER JOIN Shippers as s ON o.ShipperID = s.ShipperID
WHERE s.ShipperName = "Speedy Express"
The result of this SQL query indicates that 54 orders were shipped by Speedy Express.
In this case, I would like the last name of the employee which is a column present in the Employees table. But, I also need the number of orders per employee. Therefore, I also need information from the Orders table. To link both of these tables I call the INNER JOIN keywords on the Employee's ID. To merge all the orders with the same Employee ID together, I called the GROUP BY statement. It groups rows that have the same values into summary rows. In my case, It would group all the orders with the same ID together.
Finally, to have the highest amount of orders, I call the MAX() function on the count of orders total_count and retrieve the LastName associated with this value.
SELECT LastName, MAX(total_count) FROM (SELECT LastName,
COUNT(*) AS total_count FROM Employees AS e INNER JOIN Orders
AS o on e.EmployeeID = o.EmployeeID GROUP BY o.EmployeeID)
The result of this SQL query indicates that the last name of the employee with the most orders is Peacock with 40 orders.
In this case, I want the ProductName which is present in the Products table. I also need to know the number of times this product was ordered. This can be interpreted in two ways.
1.If by "ordered the most", we only take the number of orders with that product, then the query would be as follow:
SELECT ProductName, MAX(total_orders) FROM
(SELECT ProductName, COUNT(*) AS total_orders,
Country FROM Customers AS c
INNER JOIN Orders as o on c.CustomerID = o.CustomerID
INNER JOIN OrderDetails as d on o.OrderID = d.OrderID
INNER JOIN Products as p on p.ProductID = d.ProductID
WHERE c.Country = "Germany"
GROUP BY p.ProductID)
The result of this SQL query indicates that the product that was ordered the most by customers in Germany is Gorgonzola Telino. It was ordered 5 times.
SELECT ProductName, MAX(total_quantity) FROM (SELECT
ProductName, SUM(Quantity) AS total_quantity,
Country FROM Customers AS c
INNER JOIN Orders as o on c.CustomerID = o.CustomerID
INNER JOIN OrderDetails as d on o.OrderID = d.OrderID
INNER JOIN Products as p on p.ProductID = d.ProductID
WHERE c.Country = "Germany"
GROUP BY p.ProductID)
In this case, the product ordered the most is the Boston Crab Meat with a total of 160 times.
Both of these queries are very similar. For the first query, I went for the number of orders with a product. In this case, I used the function COUNT() to get the number of orders. To link the Customers table with the Orders table, I used the keyword INNER JOIN on the CustomerID. After, to obtain the information about the product, I needed to link the Orders Table to another one. It will be the OrderDetails Table. Once again, I called the INNER JOIN keyword now on the ProductID. To only keep the rows with the country Germany, I used the WHERE clause. I ended the query by calling the GROUP BY statement to merge all the rows with the same ProductID.
The second query is the same but instead of going for the number of orders, I requested the SUM() of the Quantity column.